Skip to content

Conversation

@Frando
Copy link
Member

@Frando Frando commented Jun 27, 2025

Description

Adds a test that closes a client endpoint and recreates it right away with the same node id. It tries to connect to a server it was previously connected to to. The test fails on main.

The reason is that the server thinks the client's previous UDP send address is still valid. Even though it receives both disco pings and quic packets from the new address, it doesn't update the best_addr used for sending, therefore the server's QUIC replies are sent into nirvana, and the test times out.

The second commit fixes this behavior by always sending disco pings to new paths.

Breaking Changes

Notes & open questions

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.
    • Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:

@github-actions
Copy link

github-actions bot commented Jun 27, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3372/docs/iroh/

Last updated: 2025-06-30T09:24:39Z

@github-actions
Copy link

github-actions bot commented Jun 27, 2025

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 38380b8

@n0bot n0bot bot added this to iroh Jun 27, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Jun 27, 2025
@Frando Frando force-pushed the Frando/test-abort-reconnect branch from 56c059c to 34ec2a2 Compare June 27, 2025 09:14
@Frando Frando changed the title tests(iroh): add test to reconnect after forceful abort tests(iroh): failing test to reconnect after endpoint recreation Jun 27, 2025
@Frando Frando changed the title tests(iroh): failing test to reconnect after endpoint recreation tests(iroh): failing test for reconnecting after endpoint recreation Jun 27, 2025
@Frando Frando force-pushed the Frando/test-abort-reconnect branch 2 times, most recently from fc031ea to 01a23ec Compare June 27, 2025 09:37
@Frando Frando changed the title tests(iroh): failing test for reconnecting after endpoint recreation fix(iroh): always ping new paths, add test for reconnecting with changed addrs Jun 27, 2025
@Frando Frando force-pushed the Frando/test-abort-reconnect branch from 01a23ec to 087f003 Compare June 27, 2025 09:44
Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good fix! I can't think of any reason why this extra ping back would ever do much harm (famous last words).

My only concern is the 5s is asking for flakyness, but I can't really think of something better right away.

.e()??;
assert_eq!(rx.recv().await.unwrap().unwrap(), 23);
// close the endpoint in a separate task, to not lose time for our immediate respawn testing
let close1 = tokio::task::spawn(async move { ep.close().await });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this take? Since both endpoints are still there this should be pretty fast, no? Feels a bit odd to spawn this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'll remove it again. It was a trial for the windows failure, but that's not it, see comment below

@flub
Copy link
Contributor

flub commented Jun 27, 2025

My only concern is the 5s is asking for flakyness, but I can't really think of something better right away.

Oh, lol. Windows CI already fails because of this... Guess we'll have to figure this out right away.

@flub
Copy link
Contributor

flub commented Jun 27, 2025

I think you might be able to set this to a 30s timeout because without a relay that would still fail after 30s in the old version? You can verify that.

@flub
Copy link
Contributor

flub commented Jun 27, 2025

Alternatively you can mimic what #3350 does and measure the time it takes to connect initially. Then require the 2nd connection to succeed in 3 times that base time. Could be more robust?

@Frando
Copy link
Member Author

Frando commented Jun 27, 2025

The flakyness is not only due to slow CI I think, I got it to fail locally too from time to time. My current reasoning is that the fix applied here is not enough (pasting from discord):

so with the change from the PR, a new path is pinged. once we get the pong, we check if this path should be the new best_addr. which takes us to here:
https://github.com/n0-computer/iroh/blob/main/iroh/src/magicsock/node_map/best_addr.rs#L111
the new path will only become the best_addr (and thus the send addr for the next quic packet) if it has a lower latency than the current best_addr, or if the current best_addr exceeds its trust interval
in our test case, the previous addr is still trusted. the path is also physically identical, so it is purely random which of the two pingpongs had a lower latency
if the pingpong from the new client endpoint has a lower latency, the addr from the new client endpoint is used from then on. if the pingpong from the old client endpoint happened to have a lower latency, the best_addr is kept and future udp packets are sent into nirvana

@Frando Frando force-pushed the Frando/test-abort-reconnect branch from 0b10a19 to a00644a Compare June 27, 2025 11:47
@Frando
Copy link
Member Author

Frando commented Jun 30, 2025

I pushed a commit that makes the test pass reliably (locally for me at least):

a8ee257

It adds the following case to BestAddr::insert_if_better_or_reconfirm, which is called when a pong is received on a UDP path:

// If we receive a pong on a different port but the same IP address as the current best addr,
// we assume that the endpoint has rebound, and thus use the new port.
} else if state.addr.addr.ip() == addr.ip() {
    self.insert(addr, latency, source, confirmed_at)
}

So when we receive a pong from the same IP address as our current best_addr, but with a different port, we unconditionally switch our best addr to that new port, even if the latency is lower than the best addr's latency from the last pingpong.

So far this was the simplest I could come up with. Can you think of any scenarios where that would lead to undesired outcomes?

@Frando Frando requested a review from flub June 30, 2025 09:31
state.confirmed_at = confirmed_at;
state.trust_until = Some(source.trust_until(confirmed_at));
// If we receive a pong on a different port but the same IP address as the current best addr,
// we assume that the endpoint has rebound, and thus use the new port.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main issue with this logic is that I find it fairly arbitrary. It fixes an artificial test case, but it seems like a sub-set of the cases that need to be fixed. We're also not really sure we can send on this path at this point, but I'm not even sure if that's something we're always sure of anyway.

@flub
Copy link
Contributor

flub commented Jul 9, 2025

I wonder if the original version should still be merged here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

2 participants